Speech synthesis based on cricothyroid and cricoid modeling

ABSTRACT

Sound generating parameters are used for outputting fundamental frequency and a command regarding prosody, and a sound source generator. The sound generation device further includes use of an accent command and a descent command for calculating fundamental frequency and incorporates a rhythm command, which is representable by a sine wave. The device also uses character string analysis for analyzing a character string and generating a command concerning phoneme and prosody, a calculating element for outputting fundamental frequency as sound generation parameters, which depends on prosody, a sound source generator, and an articulator that depends on a phoneme command.

FIELD OF THE INVENTION

This invention relates to speech synthesis and speech analysis and moreparticularly to a sound source generator, speech synthesizer, and speechsynthesizing system and method having improved versatility and precisionof sound source generation.

BACKGROUND

The production of speech consists of a combination of three elements:generation of a sound source, articulation by the vocal tract, andradiation from the lips and nostrils. By simplifying these elements andseparating sound source and articulation, a generation model of speechwaveform can be represented.

Generally, speech has two characteristics. One, relating toarticulation, is the phonemic characteristic, which is mainly shown inthe change patterns of the spectrum envelope of the sound. The other,relating to the sound source, is the prosody characteristic, which ismainly shown in the fundamental frequency patterns of the sound.

In speech synthesis based on text data, the required information forsynthesizing the phonemic characteristic can be obtained from the textdata by using morphological analysis. In contrast, the waveform offundamental frequency required for synthesizing the prosodycharacteristic is not shown in the text data. Therefore, this waveformmust be obtained according to the accent pattern of a word, the syntaxof a sentence, the discourse structure of sentences, and so on.

The Fujisaki model is one of the well-known models for generation offundamental frequency. A focus of this model is that the contour offundamental frequency will remain nearly constant, regardless of theoverall fundamental frequency, when the pattern of time curves offundamental frequency is expressed with a logarithm. Further, the modelassumes that the fundamental frequency pattern actually observed isrepresented by the sum of the phrase component, which moderately fallsfrom the beginning through the end of the phrase, and the accentcomponent, which indicates the accent on each word. From thisassumption, both components are approximated by a second-order criticaldamping linear system response against the impulse phrase command, and astep accent command.

As described above, based on the word's accent pattern, the syntax of asentence, and the discourse structure of sentences, the phrase commandand the accent command are calculated, for which fundamental frequencycan then be determined.

However, the above model for the generation of fundamental frequency hasthe problem that the fundamental frequency cannot be controlled moreprecisely, because only rise in fundamental frequency is taken intoconsideration. In other words, there is a limitation in adding a variousexpression into synthesized speech sound. Another problem is that thephrase command and the accent command can uncertainly be obtained whenanalyzing the observed fundamental frequency pattern.

Another problem is that a time lag occurs between the timing ofdesignating the phrase command and the timing when the phrase componentactually appears because the response of a second-order critical dampinglinear system against the impulsive phrase command is regarded as aphrase component.

SUMMARY OF THE INVENTION

An object of the present invention is to provide speech synthesis andsound source generation capable of solving the problems of the prior artand capable of adding various expressions, and further to provide speechanalysis capable of analyzing fundamental frequency precisely.

The sound source generation device is characterized in that the devicecomprises: calculating component for sound source generating parametersfor outputting fundamental frequency at least as sound source generatingparameters, upon receiving the command concerning prosody and accordingto the said command, and sound source generating component forgenerating sound source upon receiving sound source generatingparameters from calculating component for sound source generatingparameters and according to the said sound source generating parameters,wherein not only the accent command but also the descent command aregiven for calculating fundamental frequency, and calculating componentfor sound source generating parameters calculates sound sourcegenerating parameters according to the accent command and the descentcommand.

The sound source generation device is further characterized in that therhythm command is further given for calculating fundamental frequencyand calculating component for sound source generating parameterscalculates sound source generating parameters according to the accentcommand, the descent command, and the rhythm command.

The sound source generation device is further characterized in that therhythm command is represented with a sine wave.

The sound source generation device is further characterized bycontrolling the characteristic of the generated sound source by means ofcontrolling the amplitude and cycle of a sine wave.

The speech synthesis device is further characterized in that the devicecomprises: character string analyzing component for analyzing a givencharacter string and generating the command concerning phoneme and thecommand concerning prosody, calculating component for sound sourcegenerating parameters for outputting fundamental frequency as soundsource generation parameters at least, upon receiving the commandconcerning prosody generated by character string analyzing component andaccording to the said command, sound source generating component forgenerating sound source, upon receiving sound source generatingparameters from calculating component for sound source generatingparameters and according to the said sound source generation parameters,and articulation component for articulating sound source from soundsource generating component according to the command concerning phonemereceived from character string analyzing component, wherein characterstring analyzing component described above generates not only the accentcommand but also the descent command as the command concerning prosody,and calculating component for sound source generating parametersdescribed above calculates fundamental frequency according to the accentcommand and the descent command.

The speech synthesis device is further characterized in that characterstring analyzing component further generates the rhythm command as thecommand concerning prosody and calculating component for sound sourcegenerating parameters calculates fundamental frequency according to theaccent command, the descent command and the rhythm command.

The speech synthesis device is further characterized in that calculatingcomponent for sound source generating parameters generates the rhythmcommand as a sine wave.

The speech synthesis device is further characterized in that calculatingcomponent for sound source generating parameters controls thecharacteristic of synthesized speech sound generated, by means ofcontrolling the amplitude and cycle of the said sine wave.

The speech processing method is further characterized by adopting notonly the accent command but also the descent command as elements forcontrolling fundamental frequency in any speech processing method usingfundamental frequency as parameters at least. The term “speechprocessing” here refers to operations in any way to process speech,characteristic concerning speech sound and parameters, including speechsynthesis, sound source generation, speech analysis, and fundamentalfrequency generation therefor.

The speech processing method is further characterized by furtheradopting the rhythm command as elements for controlling fundamentalfrequency.

The speech analyzing method is further characterized by carrying outanalysis using not only the accent command but also the descent commandas elements for analyzing fundamental frequency.

The speech analyzing method is further characterized by further adoptingthe rhythm command as elements for analyzing fundamental frequency.

The storing medium is a computer-readable storing medium for storingprograms which are executable by using a computer, for executing anydevice or method of the present invention by using a computer. Thephrase “programs executable by using a computer” here, refers toprograms stored on the said storing medium are directly executableincluding the said programs which are compressed are executable afterbeing decompressed. This also includes the case of execution incombination with other programs such as operating system and library.The term “storing medium” refers to a medium for storing programs suchas a floppy disk, a CD-ROM, a hard disk, and so on.

In an embodiment of the present invention, the sound source generationdevice, the speech synthesis device, and the speech processing methodare characterized by adopting not only the accent command but also thedescent command as elements for controlling fundamental frequency. Thus,according to this embodiment of the present invention, fundamentalfrequency is controlled more precisely, and more expressive sound sourcegeneration and speech synthesis is implemented.

In an embodiment of the present invention, the sound source generationdevice, the speech synthesis device, and the speech processing methodare characterized by further adopting the rhythm command as an elementfor controlling fundamental frequency. Thus, with this embodiment,fundamental frequency is controlled more precisely, and more expressivesound source generation and speech synthesis is implemented.

In an embodiment of the present invention, the speech analyzing methodis characterized by carrying out analysis using not only the accentcommand but also the descent command as elements for analyzingfundamental frequency. Thus, with this embodiment, speechcharacteristics are more precisely analyzable.

In an embodiment of the present invention, the speech analyzing methodis characterized by further adopting the rhythm command as an elementfor analyzing fundamental frequency. Thus, speech characteristics aremore precisely analyzable.

Additional objects, advantages and novel features of the invention willbe set forth in part in the description that follows, and in part willbecome more apparent to those skilled in the art upon examination of thefollowing or upon learning by practice of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

In the figures:

FIG. 1A shows an overall configuration of the speech synthesis device asan embodiment of the present invention;

FIG. 1B shows an overall configuration of the speech synthesis device asan embodiment of the present invention;

FIG. 2 shows a hardware configuration using a CPU for embodiment of thedevice shown in FIG. 1;

FIG. 3 is a flow of a method for an embodiment of the present invention;

FIG. 4A shows the contents of a word dictionary for an embodiment of thepresent invention;

FIG. 4B shows the contents of a dictionary of syllable duration for anembodiment of the present invention;

FIG. 4C shows the result of syllable analysis for an embodiment of thepresent invention;

FIG. 4D shows the contents of a dictionary of voiced/unvoiced sounds ofconsonants/vowels for an embodiment of the present invention;

FIG. 4E shows the contents of dictionary of amplitude for each phonemeor syllable for an embodiment of the present invention;

FIG. 4F shows the contents of phoneme dictionary for an embodiment ofthe present invention;

FIG. 5 shows a schematic diagram of accent value, descent value, and thecalculated fundamental frequency for an embodiment of the presentinvention;

FIG. 6A and FIG. 6B show schematic diagrams of fundamental frequencygeneration for another embodiment of the present invention;

FIG. 7 shows a calculated voiced sound source amplitude Av and unvoicedsound source amplitude Af for an embodiment of the present invention;

FIG. 8 describes the function of sound source generation componentaccording to an embodiment of the present invention;

FIG. 9 shows sound source generated for an embodiment of the presentinvention;

FIG. 10 shows sound source after articulation for an embodiment of thepresent invention;

FIG. 11 shows the model of the larynx for an embodiment of the presentinvention;

FIG. 12 shows the mechanism of raising fundamental frequency for anembodiment of the present invention;

FIG. 13 shows the mechanism of lowering fundamental frequency for anembodiment of the present invention;

FIG. 14 shows the larynx model using springs according to an embodimentof the present invention;

FIG. 15 shows a force τ1, a force τ2 and fundamental frequency for anembodiment of the present invention;

FIG. 16 shows a dotted pitch pattern for an embodiment of the presentinvention;

FIG. 17 shows τ1-τ2 for an embodiment of the present invention;

FIG. 18 is a straight line indicating the inclination of fundamentalfrequency for an embodiment of the present invention;

FIG. 19 shows input parameters for the fundamental frequency generationmodel according to an embodiment of the present invention;

FIG. 20 shows calculated results and the sampling data of fundamentalfrequency for an embodiment of the present invention;

FIG. 21 shows speech utterance by a male for an embodiment of thepresent invention;

FIG. 22 shows speech utterance by a male for an embodiment of thepresent invention;

FIG. 23 shows speech utterance by a female for an embodiment of thepresent invention;

FIG. 24 shows speech utterance by a female for an embodiment of thepresent invention; and

FIG. 25 shows calculated results and the sampling data for the Osakadialect according to an embodiment of the present invention.

DETAILED DESCRIPTION

Fundamental Frequency Generation Model

In order to describe an embodiment of the sound source generation deviceof the present invention, a description of the fundamental frequencygeneration model used for the device is necessary. This model is asfollows.

In order to obtain the calculation model of fundamental frequency, whichis based on the assumption that the movements of muscles and bones inthe larynx area can be approximated by the counter relation of twomovements of a mechanism of vocal folds stretch and contraction, thesephysiological movements are converted to a simplified model, and thenthe model is converted into a mathematical expression, which can becontrolled with some given parameters.

Relationship between Fundamental Frequency and Vocal Folds

Regarding the vocal folds as a spring, the relation between vibrationfrequency (f0) and tension (T) is

f0=a{square root over (T)}

Experimentally, it has been proven that the relation between a muscle'stension and stretch (x) is described by the following equation:$\frac{T}{x} = {{bT} + c}$

where a, b and c are constants. Given these initial conditions, whenx=0, and T=0, this equation is solved as follows: $\begin{matrix}{T = \quad {\frac{c}{b}\left( {{\exp ({bx})} - 1} \right)}} \\{{\approx \quad {\frac{c}{b}{\exp ({bx})}\quad {as}\quad {\exp ({bx})}}}\operatorname{>>}1}\end{matrix}$

According to the above two equations: $\begin{matrix}\begin{matrix}{{\ln \quad {f0}} = \quad {{\frac{b}{2}x} + {\ln \left( {\sqrt{\frac{c}{b}}a} \right)}}} \\{= \quad {{c_{1}x} + c_{2}}} \\{{c_{1} = \quad \frac{b}{2}},{c_{2} = {{\ln \left( {\sqrt{\frac{c}{b}}a} \right)}\quad {is}\quad {{derived}.}}}}\end{matrix} & (1)\end{matrix}$

It is clear from the above equations that the logarithm of fundamentalfrequency is proportional to the extension of the vocal folds (x). Thus,the fundamental frequency can be controlled if how the vocal foldsextension is affected by movements of muscles and bones in the larynxarea can be represented in the model.

Physiological Mechanism of the Larynx Movements

FIG. 11 shows a schematic model of muscles and cartilage around thevocal folds area. The thyroid cartilage and the cricoid cartilage areconnected by the cricothyroid muscle and the vocal folds.

Analyzing the movements around the vocal folds area during speechproduction, the following three movements of muscles and bones areconsidered to have the greatest effect on vocal folds stretch andcontraction: 1) when the cricothyroid muscle contracts, the thyroidcartilage rotates, and the vocal folds are stretched (see FIG. 12); 2)when the sternohyoid muscle lowers the larynx along the vertebrae, thecricoid cartilage rotates, and then the vocal folds contract (see FIG.13); and 3) when the muscle between the hyoid bone, which is situated infront of the thyroid cartilage, and the thyroid cartilage, contracts,the thyroid cartilage rotates, and the vocal folds are stretched.

The first and third movements are encompassed in the Fujisaki Model. Thesecond movement is newly considered and applied by the fundamentalfrequency generation model of an embodiment of the present invention.

Although both the first and third movements involve the stretch of vocalfolds, the third is less effective than the first and supplementary toit. Therefore, the third movement is best considered in conjunction withthe first movement.

Accordingly, an embodiment of the present invention assumes that thevariation of the vocal folds stretch is affected by two forces: 1) aforce (τ1) that causes the thyroid cartilage to rotate, by virtue ofcricothyroid muscle contraction, toward the vocal folds stretch; and 2)a force (τ2) that causes the cricoid cartilage to rotate toward thevocal folds contraction.

Mathematical Expression of Movements in the Model

In order to obtain the mathematical expression of the fundamentalfrequency generation model, as described above, and to control the modelusing parameters τ1 and τ2, a more simplified structure is substituted,which assumes that muscles in the model are springs and that the thyroidcartilage is an object rotating at a fixed distance (FIG. 14).

In FIG. 14, τ1 denotes the force in the direction of stretch of thevocal folds in order to raise fundamental frequency; τ2 denotes theforce in the direction of contraction of the vocal folds in order tolower fundamental frequency; θ denotes the angle of rotation of thethyroid cartilage; m and r denote the mass and the length of the thyroidcartilage, respectively; R denotes the resistance when the thyroidcartilage rotates; and k1 and k2 denote the spring constants of thevocal folds and the cricothyroid muscle, respectively, when modeled assprings. Although τ1 and τ2 vary with time, they are assumed here to beinvariant with time in order to solve differential equation (2), below,in a simple manner.

In this model, considering the force balance of rotating direction, thefollowing equation applies.

mr ^(2{umlaut over (θ)}) =−R{dot over (θ)}−(c ₃ K ₁ +c ₄ k ₂)θ+(τ₂τ₁)

∴mr ^(2{umlaut over (θ)}) +R{dot over (θ)}+Kθ=τ  (2)

K=c ₃ k ₁ +c ₄ k ₂, τ=τ₂−τ₁,

where C₃ and C₄ are constants.

mr ^(2{umlaut over (θ)}) +R{dot over (θ)}+Kθ=0,

supplemental equation.

A set of fundamental solutions (θ 1, θ 2) for the supplemental equationand one particular solution (η) for the equation (2) are next obtained,so that the solution of this inhomogeneous linear differential equationof second order is described as follows:

θ(t)=c ₅θ1+c ₆θ2+η  (3),

where c₅ and c₆ are constants.

The supplemental equation then is considered in order to obtain thefundamental solution.

Next, assuming that the solution of this equation is the criticaldamping, the fundamental solution becomes: $\begin{matrix}{{{{{\theta 1}(t)} = {\exp \left( {{- \beta}\quad t} \right)}},{{{\theta 2}(t)} = {t\quad {\exp \left( {{- \beta}\quad t} \right)}}}}{\beta = {\frac{R}{2{mr}^{2}}.}}} & (4)\end{matrix}$

If the particular solution for the equation (2) is set as follows:

η=θ(t)At ² +Bt+C,

the following are then derived:

θ=2At+B

{overscore (θ)}=2A.

Substituting these into the equation (2) produces the following:

kAt ²+(2RA+KB)t+2mr ² A+RB+KC−τ=0.

Since this holds for any t, the following are derived: kA = 02RA + KB = 0${{{2{mr}^{2}A} + {RB} + {KC} - \tau} = {{0\quad\therefore\quad A} = 0}},{B = 0},{C = {\frac{\tau}{K}.}}$

Therefore, the particular solution is described as follows:$\begin{matrix}{\eta = {\frac{\tau}{K}.}} & (5)\end{matrix}$

According to equations (3), (4), and (5), the fundamental solution forthe equation (2) is described as follows: $\begin{matrix}\begin{matrix}{{\theta (t)} = \quad {{c_{5}{\theta 1}} + {c_{6}{\theta 2}} + \eta}} \\{= \quad {{c_{5}{\exp \left( {{- \beta}\quad t} \right)}} + {c_{6}t\quad {\exp \left( {{- \beta}\quad t} \right)}} + \frac{\tau}{K}}} \\{= \quad {{\left( {c_{5} + {c_{6}t}} \right){\exp \left( {{- \beta}\quad t} \right)}} + \frac{\tau}{K}}}\end{matrix} & (6) \\{{\therefore\quad {\overset{.}{\theta}(t)}} = {\left\{ {{{- \beta}\quad c_{5}} + {\left( {1 - {\beta \quad t}} \right)c_{6}}} \right\} {{\exp \left( {{- \beta}\quad t} \right)}.}}} & (7)\end{matrix}$

From the initial conditions, given θo, and θ=θo at t=0, the followingare derived:${\theta (0)} = {{c_{5} + \frac{\tau}{K}} = {{\theta_{0}\quad\therefore\quad c_{5}} = {\theta_{0} - \frac{\tau}{K}}}}$${\overset{.}{\theta}(0)} = {{{{- \beta}\quad c_{5}} + c_{6}} = {{{\overset{.}{\theta}}_{0}\quad\therefore\quad c_{6}} = {{{\beta \quad c_{5}} + {\overset{.}{\theta}}_{0}} = {{\beta \quad \theta_{0}} - {\beta \frac{\tau}{K}} + {{\overset{.}{\theta}}_{0}.}}}}}$

Substituting derived c₅ and c₆ from above into equations (6) and (7)produces the following: $\begin{matrix}{{\theta (t)} = {{\frac{\tau}{K}\left\{ {1 - {\left( {1 + {\beta \quad t}} \right){\exp \left( {{- \beta}\quad t} \right)}}} \right\}} + {\left\{ {{{\overset{.}{\theta}}_{0}t} + {\left( {1 + {\beta \quad t}} \right)\theta_{0}}} \right\} {\exp \left( {{- \beta}\quad t} \right)}}}} & (8) \\{{\overset{.}{\theta}(t)} = {\left\{ {{- {\beta \left( {\theta_{0} - \frac{\tau}{K}} \right)}} + {\left( {1 - {\beta \quad t}} \right)\left( {{\beta \quad \theta_{0}} - {\beta \frac{\tau}{K}} + {\overset{.}{\theta}}_{0}} \right)}} \right\} {{\exp \left( {{- \beta}\quad t} \right)}.}}} & (9)\end{matrix}$

If θ is a minute value, x=c₇θ is derived since x and θ may be regardedas proportional. Substituting this equation into equation (1) withInf0=c₁x+c₂, produces the following:

ln f0(t)=c ₁ c ₇θ(t)+c ₂ =c ₈θ(t)+c ₂  (10)

c ₈ =c ₁ c ₇.

Although the value of c₈ cannot be obtained without analyzing samplingdata, in this case, c₈=1 is assumed for the simplicity of the equation.

Based on the above calculations, the following is determined:$\begin{matrix}{\begin{matrix}{{\ln \quad {{f0}(t)}} = \quad {{\theta (t)} + c_{2}}} \\{= \quad {{\alpha \left\{ {1 - {\left( {1 + {\beta \quad t}} \right){\exp \left( {{- \beta}\quad t} \right)}}} \right\}} +}} \\{\quad {{\left\{ {{{\overset{.}{\theta}}_{0}t} + {\left( {1 + {\beta \quad t}} \right)\theta_{0}}} \right\} {\exp \left( {{- \beta}\quad t} \right)}} + {C_{2}.}}}\end{matrix}{\alpha = {\frac{\tau}{K}.}}} & (11)\end{matrix}$

Therefore,

f0(t)=f _(default) ×exp{α{1−(1+βt)exp(−βt)}+{{dot over (θ)}₀t+(1+βt)θ₀}exp(−βt)}  (12)

f _(default)=exp(c ₂)  (13),

where f_(default) is the fundamental frequency, given that forces (τ1,τ2) corresponding to fundamental frequency changes are not present and βis a constant that varies depending on a talker.

It is apparent from equation (12) that the angle of thyroid cartilagerotation and the stretch of the vocal folds can be calculated using thecounter relation between a force (τ1) in the direction to raise thefundamental frequency and a force (τ2) in the direction to lower thefundamental frequency, and consequently fundamental frequency changescan be determined.

Implementation

Fundamental frequency can be calculated by inputting parametersα(=(τ1−τ2)/K) into equation (12). However, to obtain the solution forequation (12) simply, in the equation (2), τ1 and τ2 are assumed asconstants that are invariant with time. However, τ1 and τ2 are actuallyvariables with time because muscles and bones are moving duringutterances.

Assuming that time t takes on a discrete value as t=nΔt(n=0,1,2,3 . . .) and τ1, τ2, and α have arbitrary values in every infinitesimal timeΔt, equations (8), (9), and (12) may be rewritten, respectively, asfollows.

θ(ηΔt)=α(ηΔt){1−(1+βΔt)exp(−βΔt)}+{{dot over(θ)}((n−1)Δt)Δt+(1+βΔt)θ((n−1)Δt)}exp(−βΔt)  (14)

{dot over(θ)}(nΔt)={−β(θ((n−1)Δt)−α(nΔt))+(1−βΔt)(βθ((n−1)Δt)−α(nΔt)×β+{dot over(θ)}((n−1)Δt))}exp(−βΔt)  (15)

f0(nΔt)=f _(default)×exp{α(nΔt){1−(1+βΔt)exp(−βΔt)}+{{dot over(θ)}((n−1)Δt)Δt+(1+βΔt)θ((n−1)Δt)}exp(−βΔt)}  (16)

And θ(nΔt), {dot over (θ)}(nΔt), f0(nΔt) can be calculated fromequations (14), (15) and (16), if θ((n−1)Δt), {dot over (θ)}((N−1)Δt),α(nΔt) is determined.

Consequently, f0(nΔt)(n=0,1,2,3, . . . ) for an arbitrary time can becalculated by supplying f0(nΔt)(n=0,1,2,3 . . . ) for input.

Generating Method of Fundamental Frequency Patterns

The above description provides the mathematical expression of the modelfor an embodiment of the present invention and the inputs for generatingpatterns of time curves of fundamental frequency. The method todetermine the input parameters will now be described.

The characteristic of accents in Tokyo dialect is that there is always afundamental frequency rise or fall from the first mora through thesecond mora, and a fall in fundamental frequency happens definitely oncein a word.

The “accent dictionary,” which is based on this rule, indicates thebasic accent points of words. For example, a sentence “/ka re no imootoga kekkon suru/”, which means “his sister is going to marry,” isexhibited with a dotted pitch pattern indicating that the fundamentalfrequency rises and falls in each syllable, as shown in FIG. 16.

In an embodiment of the present invention, analyzing the speechutterance, the starting point and the duration of an utterance for eachsyllable are examined, and comparing the height of the rectangularpatterns of accent of FIG. 16 and extracted data (FIG.17), and then thevalue of α is determined (see FIG. 17). As shown in FIG. 17, the forceτ1 is greater than the force τ2 in the area where the value is positiveand the force τ2 is greater in the area where the value is negative.

Tokyo dialect has the observed phenomenon of fundamental frequencymoderately falling from the beginning through the end of a phrase. Inorder to obtain this overall change pattern of curves, in an embodimentof the present invention, an approximate straight line of fundamentalfrequency patterns extracted from the speech sampling data is derivedusing the least squares method (FIG. 18).

Fundamental frequency patterns, which are indicated by a solid line(FIG. 20), are obtained by inputting the sum (FIG. 19) of the value ofτ1−τ2 (FIG. 17), obtained by the above method, and the value of overallchange (FIG. 18), into the model as final input parameters.

Approximation Results of Fundamental Frequency

Next described are the results of the best approximation of fundamentalfrequency extracted from speech utterance data containing various“prominence”, using the fundamental frequency generation model of anembodiment of the present invention.

Approximation Patterns Using Fundamental Frequency Generation Model

Speech utterance data used for analysis modeling for an embodiment ofthe present invention include 78 short sentences of 13 types pronouncedby a male announcer and a female announcer. An assertive sentence andone to four kinds of sentences containing “prominence” were prepared foreach of the short sentences.

The percentage of approximation errors between the fundamental frequencyextracted from these speech utterances and the fundamental frequencygenerated using the fundamental frequency generation model of anembodiment of the present invention averages 5.8% in male utterance dataand 4.0% in female utterance data. FIGS. 21 and 23 show theapproximation data with the lowest precision of fundamental frequencyfor male and female utterances, respectively, and FIGS. 22 and 24 showthose with the highest precision. The points marked by an “X” in FIGS.22 and 24 represent the extracted fundamental frequency from speechutterance sampling data, and the solid line represents the approximationof fundamental frequency. These graphs confirm that overall changepatterns of fundamental frequency can be approximated successfully usingthe model of an embodiment of the present invention. Approximationerrors in the results occur mainly because of delay in fundamentalfrequency at start-up, generated by using the model, and time lagbetween when the accent command and the descent command are given, andthe instant when the fundamental frequency is actually affected.

Since the extracted fundamental frequency from speech utterancescontains errors that occur at the extraction, it is expected that datawithout these errors will reduce approximation errors.

Factors Affecting Approximation Errors

Approximation errors appear to be caused at least in part by the timelag between when the accent command and the descent command of thefundamental frequency generation model are given and the instant whenthe changes actually start to occur. Even in the speech utterance by ahuman being, there is a slight time lag between when the accent anddescent command are provided and the instant when fundamental frequencyis actually affected.

In the model of an embodiment of the present invention, this time lag isrepresented with the parameter β in equation (4). Since this parameteris dependent on the talker, an accurate value of β for the talker ofparticular recorded data can be obtained only by determining at whichvalue of β the very best approximation of fundamental frequency isobtained, while β is varied.

The result of analysis using actual data with varied values of β confirmthat the best approximation of fundamental frequency is obtained when βis greater than the expected value (β=20), for which the above mentionedtime lag from input parameters is shortest.

Effectiveness of Fundamental Frequency Generation Model

According to the calculation of approximation error in the fundamentalfrequency generation model of an embodiment of the present invention,the percentage of error is no more than 9%. Therefore, this model isdetermined to be accurate for generating fundamental frequency patternsof speech utterance data for assertive sentences and speech utterancedata containing “prominence”.

Accordingly, a more adequate approximation is available with this modelthan the prior art with regard to the generation of fundamentalfrequency patterns for various kinds of speech utterance data.

Application to Other Dialects except Tokyo Dialect

As described above, the Tokyo dialect includes the phenomenon offundamental frequency moderately descending from the beginning throughthe end of a phrase. In contrast, in Osaka dialect, this phenomenoncannot always be observed. Generating the fundamental frequency usingthe model of an embodiment of the present invention requires use ofparameters for the rhythm component corresponding to these tendencies inspoken sentences in the Osaka dialect.

The following are characteristics of speech tendency, as observed inoverall Japanese speech utterances: 1) it is pronounced by a clause orby a unit of intention; 2) between the units of utterances, there is a“re-start-up of fundamental frequency,” raising the fundamentalfrequency which starts falling; 3) a “re-start-up of fundamentalfrequency” can occur in the same breath group; and 4) Osaka dialect isspoken with a rhythm specific to a talker.

A sine wave is used to represent simply the above tendency, because asine wave can approximate “specific speech rhythm” and “re-start-up offundamental frequency” by using only the amplitude and cycle of the sinewave. A single wavelength of sine wave thus represents the duration fromthe instant when speech utterances start to the instant when“re-start-up of fundamental frequency” occurs after judging from thespeech utterances.

Application Results

By using a sine wave to approximate parameters of the rhythm componentcorresponding to this tendency of overall speech utterance, thefundamental frequency generation model of an embodiment of the presentinvention is applied to a speech utterance of /ee ojo-san ni narimashita na-/ (which means “what a nice girl she has grown up to be”) inOsaka dialect. The results using this parameter, which are presented inFIGS. 25A, and 25B, clearly show the accent command and the descentcommand, respectively. FIG. 25D shows fundamental frequency patterns (asolid line) generated by using the fundamental frequency generationmodel of an embodiment of the present invention. The broken lines inFIG. 25D represent the extracted fundamental frequency from speechutterances, and the sine wave representation of the rhythm component isshown with a dotted line. FIG. 25C shows for comparison the same datawith parameters indicating a speech tendency using a least squaresmethod instead of a sine wave representation of the rhythm component.

It is clear from FIGS. 25A-25D that the better approximation offundamental frequency can be obtained by using a sine waverepresentation of the rhythm component. Furthers the very short cycle ofa sine wave also accurately approximates the characteristic in Osakadialect of a specific rhythm during utterances. Consequently, the speechindividuality of fundamental frequency can be analyzed and synthesizedby controlling the cycle and amplitude of the sine wave.

Approximations of the fundamental frequency of other dialects or otherforeign languages can also be obtained by selecting type of waveformpattern, cycle, and amplitude for the rhythm component. The analysis ofthe fundamental frequency of other dialects or other foreign languagesis thus also practicable.

Device Configuration Example

Based on the fundamental frequency generation model of an embodiment ofthe present invention, as described above, a sound source generationdevice and a speech synthesis device can be implemented. If speech isanalyzed with consideration of the accent command and the descentcommand according to the fundamental frequency generation model of thisembodiment, more elaborate analysis can be performed.

FIG. 1A shows an overall configuration of the speech synthesis device ofan embodiment of the present invention. While, in FIG. 1A, a device foroutputting speech sound according to a given character string is shown,this configuration is also applicable to a device for outputting speechsound according to a given concept.

As shown in FIG. 1A, a character string (text) is inputted into thecharacter string analyzing component 2. Upon receiving this characterstring input, the character string analyzing component 2 performs themorphological analysis, referring to a word dictionary 4, and generatesa phoneme symbol string. Further, the character string analyzingcomponent 2 generates a command concerning prosody, such as an accentcommand, a descent command, and a control command of syllable durationfor each syllable, referring to the word dictionary 4 and a dictionaryof syllable duration 5. The phoneme symbol string is input into thefilter coefficient control component 13, which is within thearticulation component 12. This phoneme symbol string is also input intothe calculation component for sound source generating parameters 8. Theaccent command, the descent command, and the control command of syllableduration are input into the calculation component for sound sourcegenerating parameters 8.

Using the phoneme symbol string, the calculation component for the soundsource generating parameters 8 determines whether each syllable orphoneme is a voiced or unvoiced sound by referring to a dictionary ofvoiced/unvoiced sounds of consonants/vowels 6. Moreover, thiscalculation component for the sound source generating parameters 8 alsodetermines which syllable or phoneme can be changed to an unvoiced soundby using unvoiced rules 7. Then, the component 8 determines the timecurves of sound source amplitude using reference to a dictionary ofamplitude for each phoneme or syllable 16 in accordance with the phonemesymbol string. Further, the calculation component for the sound sourcegenerating parameters 8 calculates the time curves of fundamentalfrequency Fo using the control command of syllable duration, the accentcommand, the descent command, and the distinction of voiced/unvoicedsounds of consonants/vowels. This component 8 also calculates the timecurves of voiced sound source amplitude Av and unvoiced sound sourceamplitude Af, in accordance with the control command of syllableduration, the voiced/unvoiced distinction of consonants/vowels, and thetime curves of sound source amplitude.

A sound source generating component 10 generates and outputs a soundsource waveform in accordance with the sound source generatingparameters Fo, Av, and Af. This waveform is input into the articulatingcomponent 12.

The filter coefficient control component 13, which is within thearticulating component 12, obtains the time curves of vocal tracttransmission characteristic, which are generated in accordance with thephoneme symbol string produced by the character string analyzingcomponent 2 using reference to the phoneme dictionary. Then, the filtercoefficient control component 13 outputs filter coefficients, whichimplement vocal tract transmission characteristics, to a speechsynthesis filter component 15. Thus, the speech synthesis filtercomponent 15 articulates the provided sound source waveform by usingvocal tract transmission characteristics, in synchronization with eachsyllable or phoneme, and outputs a synthesized speech sound waveform.The synthesized speech sound waveform is then converted into analogsound signals by a sound signal output circuit (not shown).

FIG. 2 shows an embodiment of a hardware configuration for the device ofFIG. 1, using a CPU. As shown in FIG. 2, connected to a bus line 30 area CPU 18, a memory 20, a keyboard 22, a floppy disk drive (FDD) 24, ahard disk 26, and a sound card 28. Programs for character stringanalysis, calculation of sound source generating parameters, soundsource data generation, and articulation are stored on the hard disk 26.These programs are installed from the floppy disk 32 using the FDD 24. Aword dictionary 4, a dictionary of syllable duration 5, a dictionary ofvoiced/unvoiced sounds of consonants/vowels 6, a set of unvoiced rules7, a dictionary of amplitude for each phoneme or syllable 16, and aphoneme dictionary 14 are also stored on the hard disk 26.

FIG. 3 is a flow chart showing the programs stored in the hard disk 26.As shown in FIG. 3, in the step S1, a character string is inputted usingthe keyboard 22. Alternatively, a character string of data stored on thefloppy disk 34 may be loaded.

Next, the CPU 18 performs morphological analysis of the character stringusing reference to the word dictionary (step S2). An example of thisword dictionary is shown in FIG. 4A. Then, the CPU 18 obtains thepronunciation of the character string, using reference to the worddictionary 4 and break up of the character string into words. Forexample, when a character string input is made as “ko n ni chi wa”, apronunciation as “koNnichiwa” is obtained. Furthermore, an accent valueand a descent value of syllables constituting words are obtained foreach word (step S3). Consequently, syllables of “ko” “N” “ni” “chi” “wa”and the accent and descent value for each syllable are obtained.Alternatively, the accent value and the descent value are determinedphoneme by phoneme. They are also determinable or correctable usingrules based on the relationships among the preceding and succeedingsequences of phonemes or syllables. The relationships between allsyllables and their duration, as shown in FIG. 4B, are stored in thedictionary of syllable duration 4 on the hard disk 26. In step S4, theCPU 18 obtains the syllable duration for each syllable of “ko” “N” “ni”“chi” “wa” given in step S2 using reference to the dictionary ofsyllable duration 4. Accordingly, a table for each syllable isgenerated, as shown in FIG. 4C.

As shown in FIG. 4D, all phonemes and their distinction ofvoiced/unvoiced sound are stored in the dictionary of voiced/unvoicedsounds of consonants/vowels 6 on the hard disk 26. In the index ofphonemes in FIG. 4D, “V” denotes vowels (voiced sound), “CU” denotesunvoiced sound of consonants and “CV” denotes voiced sound ofconsonants. The CPU 18 makes a distinction between voiced and unvoicedsound for each phoneme of “k” “o” “N” “i” “c” “h” “i” “w” “a” usingreference to the dictionary of voiced/unvoiced sounds ofconsonants/vowels 6. Furthermore, the CPU 18 determines a voiced soundthat changes to an unvoiced sound using reference to the unvoiced rules7, which are stored with data on cases where voiced sounds change tounvoiced sounds. Thus, each phoneme is evaluated as to whether itcontains voiced sound or unvoiced sound (step S5).

Next, the fundamental frequency Fo (time curves) is generated accordingto the table in FIG. 4C (especially regarding the accent value and thedescent value) (step S4). Equation (12), described above, is used toperform this generation. The calculation is carried out with the accentvalue as τ1 and the descent value as τ2. The relation among the accentvalue, the descent value, and the fundamental frequency Fo is shown as aschematic diagram in FIG. 5. The portions where the fundamentalfrequency is not calculated indicate the unvoiced sound part.

In this embodiment, the fundamental frequency Fo is determined by theaccent value and the descent value.

The fundamental frequency Fo is thus calculated as above. Next, voicedsound source amplitude Av and unvoiced sound source amplitude Af arecalculated (step S7). In the dictionary of amplitude for each phoneme orsyllable 16, the time curves of sound source amplitude corresponding toeach syllable are stored, as shown in FIG. 4E. The CPU 18, referring tothis dictionary, determines voiced sound source amplitude Av andunvoiced sound source amplitude Af for each syllable of“ko” “N” “ni”“chi” and “wa”. Also, since the voice/unvoiced distinction is necessary,sound source amplitude for voiced sound is calculated as Av and unvoicedsound is calculated as Af (FIG. 7).

Next, sound source waveforms are generated according to the fundamentalfrequency Fo, voiced sound source amplitude Av, and unvoiced soundsource amplitude Af, as calculated above (step S8). This sound sourcegeneration process is shown in the schematic diagram in FIG. 8. The timecurves of fundamental frequency Fo and the calculated time curves ofvoiced sound source amplitude Av are input into the voiced sound sourcegenerating component 40. Upon receiving these two, the voiced soundsource generating component 40 generates a vocal folds sound source withvoiced sound source amplitude Av, possessing the fundamental frequencyFo over time. The time curves of unvoiced sound source amplitude Af areinput into the noise sound source generating component 42. Uponreceiving this input, the noise sound source generating component 42generates a white noise having unvoiced sound source amplitude Af overtime. Next, a summation component 44 composites the pulse waveform andthe white noise synchronously. Thus, the sound source waveform isobtained.

Next, the articulation with consideration of the vocal tracttransmission characteristic is applied to this sound source waveform(step S9) since this sound source waveform corresponds to the soundsource waveform generated by the vocal folds and other vocal organs. Thetime curves of vocal tract transmission characteristic for each syllableare stored in the phoneme dictionary 14, as shown in FIG. 4F. The CPU 18obtains the time curves of vocal tract transmission characteristic fromthe phoneme dictionary, associating with the phoneme symbol string(pronunciations or phonemes) from the morphological analysis in step S2.Then, the CPU performs the articulation by filtering the sound sourcewaveforms of step S8, according to the vocal tract transmissioncharacteristic. In this articulation, the time period of the soundsource waveforms and the vocal tract transmission characteristic must besynchronized. FIG. 10 shows the articulated synthesized speech soundwaveform.

Furthermore, this synthesized speech sound waveform is input into asound card 28. The sound card then converts the synthesized speech soundwaveform into analog sound data and outputs speech sound through aspeaker 29.

As described above, not only the accent value, but also the descentvalue is applied for generating the fundamental frequency in anembodiment of the present invention. Thus, this embodiment allows thefundamental frequency to be controlled more precisely. For example,other local dialects are expressible distinctively by changing theaccent value and the descent value of the same word. A dictionary ofaccent values and descent values (e.g., a dictionary containing syllablesequence chains) in each dialect is utilized in an embodiment of thepresent invention. Alternatively, supplemental information dataconcerning the dialect may be added to a basic dictionary.

FIG. 1B shows an overall configuration of the speech synthesis device ofanother embodiment of the present invention. In this embodiment, arhythm component generating component 17 is included. The rhythmcomponent generating component 17 is for outputting the rhythm command,which indicates the tendency of the fundamental frequency. Thecalculating component for sound source generating parameters 8 generatesthe fundamental frequency, incorporating this rhythm command, as well asthe accent command and the descent command.

For example, fundamental frequency is generally descending when usingthe descending component indicated in FIG. 6A as the rhythm command. Itis preferable to use this descending component as the rhythm command insynthesizing Tokyo dialect.

On the other hand, a sine wave, as indicated in FIG. 6B, is preferablefor synthesizing Osaka dialect.

Thus, the speech synthesis device of this embodiment is applicable tovarious dialects and various languages by adopting the rhythm commandand controlling its waveform, cycle, and amplitude.

While the embodiment described above focuses on a device for outputtingspeech sound corresponding to characters inputted, this invention may beapplied to any type of device for generating sound source by using afundamental frequency. For example, the invention is applicable to adevice that interprets language provided, generates a character string,accent values, and descent values, and calculates fundamental frequency.Furthermore, it is applicable in an artificial electronic larynx with astructure reproduced physically from the vocal tract, in which a speakergenerating a sound source is provided instead of vocal folds. In thiscase, the articulation of step S9 is not necessary.

In the prior art, a speech synthesis model was separately developed foreach language group, such as stress typed languages like English, orlanguages with four tones like Chinese, in which each accentcharacteristic is different. In contrast, according to the presentinvention, these languages, each having different accentcharacteristics, can be synthesized with one unified model.

While, in the above embodiment, software is used to provide therespective functions shown in FIG. 1A and FIG. 1B, part or all of thefunctions may be provided by hardware configurations.

Embodiments of the present invention have now been described infulfillment of the above objects. It will be appreciated that theseexamples are merely illustrative of the invention. Many variations andmodifications will be apparent to those skilled in the art.

Glossary

The term “accent command” refers to a command for raising fundamentalfrequency. In FIG. 14, r1 corresponds to this command in the model. Inthe implementation pattern in FIG. 3, the accent value corresponds toit.

The term “descent command” refers to command for lowering fundamentalfrequency. In FIG. 14, r2 corresponds to this command in the model. Inthe implementation pattern in FIG. 3, the descent value corresponds tor2.

The term “rhythm command” refers to the command for indicating thetendency of fundamental frequency change, to which simple descent inFIG. 6A and the sine wave in FIG. 6B correspond.

The term “command concerning prosody” refers to a command for producingsound source generating parameters, to which syllable duration, accentvalue and descent value in the implementation pattern in FIG. 3correspond.

The term “command concerning phoneme” refers to a command used forarticulation, to which a phoneme symbol string in the implementationpattern in FIG. 1 correspond.

The term “sound source generating parameters” refers to the parametersrequired for generating sound source, and to which fundamental frequencyand sound source amplitude in the implementation pattern in FIG. 3correspond.

What is claimed is:
 1. A sound source generation device characterized inthat the device comprises: calculating component for sound sourcegenerating parameters for outputting fundamental frequency at least assound source generating parameters, upon receiving the commandconcerning prosody and according to the said command; and sound sourcegenerating component for generating sound source upon receiving soundsource generating parameters from calculating component for sound sourcegenerating parameters and according to the said sound source generatingparameters; wherein the generation of fundamental frequency isrepresented in the model with two forces: a force τ1 that causes thethyroid cartilage to rotate, by virtue of contraction of thecricothyroid muscle, toward the vocal folds stretch and a force τ2 thatcauses the cricoid cartilage to rotate toward the vocal foldscontraction; and both the accent command corresponding to the said forceτ1 and the descent command corresponding to the said force τ2 are givenfor calculating fundamental frequency; and calculating component forsound source generating parameters calculates sound source generatingparameters according to the accent command and the descent command.
 2. Acomputer-readable storing medium for storing programs, which areexecutable, by using a computer, for executing any device or method ofclaim 1 by using a computer.
 3. The sound source generation device ofclaim 1 characterized in that: the rhythm command indicating thetendency of fundamental frequency change is further given forcalculating fundamental frequency, and calculating component for soundsource generating parameters calculates sound source generatingparameters according to the accent command, the descent command, and therhythm command.
 4. The sound source generation device of claim 3characterized in that the said rhythm command is represented with a sinewave.
 5. The sound source generation device of claim 4 characterized bycontrolling the characteristic of the generated sound source by means ofcontrolling the amplitude and cycle of the said sine wave.
 6. A speechsynthesis device characterized in that the device comprises: characterstring analyzing means for analyzing a given character string andgenerating the command concerning phoneme and the command concerningprosody, calculating means for sound source generating parameters foroutputting fundamental frequency as sound source generating parametersat least, upon receiving the command concerning prosody generated bycharacter string analyzing means and according to the said command,sound source generating means for generating sound source, uponreceiving sound source generating parameters from calculating componentfor sound source generating parameters and according to the said soundsource generating parameters, and articulation means for articulatingsound source from sound source generating means according to the commandconcerning phoneme from character string analyzing means, wherein thegeneration of fundamental frequency is represented in the model with twoforces: a force τ1 that causes the thyroid cartilage to rotate, byvirtue of contraction of the cricothyroid muscle, toward the vocal foldsstretch and a force τ2 that causes the cricoid cartilage to rotatetoward the vocal folds contraction, and as well as, character stringanalyzing means described above generates both the accent commandcorresponding to the said force τ1 and the descent command correspondingto the said force τ2 for calculating the fundamental frequency, andcalculating means for sound source generating parameters described abovecalculates fundamental frequency according to the accent command and thedescent command.
 7. The speech synthesis device of claim 6 ischaracterized in that: character string analyzing means furthergenerates the rhythm command indicating the tendency of fundamentalfrequency change as the command concerning prosody, and calculatingmeans for sound source generating parameters calculates fundamentalfrequency according to the accent command, the descent command, and therhythm command.
 8. The speech synthesis device of claim 7 characterizedin that calculating means for sound source generating parametersgenerates the rhythm command as a sine wave.
 9. The speech synthesisdevice of claim 8 characterized in that calculating means for soundsource generating parameters controls the characteristic of synthesizedspeech sound generated, by means of controlling the amplitude and cycleof the said sine wave.
 10. A speech processing method using fundamentalfrequency as parameters at least characterized by: modeling thegeneration of fundamental frequency with two forces: a force τ1 thatcauses the thyroid cartilage to rotate, by virtue of contraction of thecricothyroid muscle, toward the vocal folds stretch and a force τ2 thatcauses the cricoid cartilage to rotate toward the vocal foldscontraction, and as well as, adopting the accent command correspondingto the said force τ1 and the descent command corresponding to the saidforce τ2 for calculating fundamental frequency as elements forcontrolling the said fundamental frequency.
 11. The speech processingmethod of claim 10 characterized by further adopting the rhythm commandindicating the tendency of fundamental frequency change as elements forcontrolling fundamental frequency.
 12. A speech analyzing method foranalyzing the characteristic of speech sound characterized by: modelingthe generation of fundamental frequency with two forces: a force τ1 thatcauses the thyroid cartilage to rotate, by virtue of contraction of thecricothyroid muscle, toward the vocal folds stretch and a force τ2 thatcauses the cricoid cartilage to rotate toward the vocal foldscontraction, and as well as, by performing analysis using the accentcommand corresponding to the said force τ1 and the descent commandcorresponding to the said force τ2 as elements for analyzing fundamentalfrequency of the speech sound.
 13. The speech analyzing method of claim12 characterized by further adopting the rhythm command indicating thetendency of fundamental frequency change as elements for analyzing thesaid fundamental frequency.